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Abstract 

Phylogenetic analyses of DNA sequence data can provide estimates of evolutionary rates and timescales. Nearly all 
phylogenetic methods rely on accurate models of nucleotide substitution. A key feature of molecular evolution is the 
heterogeneity of substitution rates among sites, which Is often modelled using a discrete gamma distribution. A widely 
used derivative of this Is the gamma-Invariable mixture model, which assumes that a proportion of sites in the sequence are 
completely resistant to change, while substitution rates at the remaining sites are gamma-distributed. For data sampled at 
the intraspecific level, however, biological assumptions involved in the Invariable-sites model are commonly violated. We 
examined the use of these models in analyses of five intraspecific data sets. We show that using 6-10 rate categories for the 
discrete gamma distribution of rates among sites Is sufficient to provide a good approximation of the marginal likelihood. 
Increasing the number of gamma rate categories did not have a substantial effect on estimates of the substitution rate or 
coalescence time, unless rates varied strongly among sites in a non-gamma-distributed manner. The assumption of a 
proportion of invariable sites provided a better approximation of the asymptotic marginal likelihood when the number of 
gamma categories was small, but had minimal impact on estimates of rates and coalescence times. However, the estimated 
proportion of Invariable sites was highly susceptible to changes in the number of gamma rate categories. The concurrent 
use of gamma and invariable-site models for Intraspecific data is not biologically meaningful and has been challenged on 
statistical grounds; here we have found that the assumption of a proportion of Invariable sites has no obvious impact on 
Bayeslan estimates of rates and timescales from intraspecific data. 
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Introduction 

In phylogenetic analyses of DNA sequence data, the evolution- 
ary process is usually described using models of nucleotide 
substitution. These models commonly assume that substitutions 
occurring at each site are described by a Markov chain and that 
difiFerent sites evolve in a mutually independent manner. In 
practice, almost all models of nucleotide substitution are time- 
reversible. The general time-reversible (GTR) model, formally 
described by Tavare [1], includes parameters that allow unequal 
frequencies for the four nucleotides and a distinct rate for each of 
the six pairwise nucleotide substitutions. By constraining one or 
more of these parameters, a family of time-reversible models can 
be generated. This gives rise to special cases of the GTR model, 
such as the Jukes-Cantor [2], Kimura [3], and Hasegawa- 
Kishino— Yano [4] models. 

In their basic form, nucleotide substitution models assume that 
the evolutionary process is homogeneous across sites. In reality, 
however, rates of mutation can vary among sites because of 
selective pressures associated with structural and functional 
constraints [5-7]. Some sites, such as CpG islands in mammalian 
taxa, have a higher propensity to mutate [8] . Failure to take into 
account this rate heterogeneity among sites (RHAS) can lead to 



biased estimation of branch lengths, with corresponding impacts 
on estimates of phylogenies, substitution rates, and evolutionary 
timescales [9-13]. 

RHAS can be modelled in a number of ways, but the most 
popular approach is to assume that the rate at each site is a 
random variable drawn from a statistical distribution. The gamma 
distribution (H-F) is most commonly used for this purpose [14,15], 
owing to its ease of interpretation and its good fit to empirical data 
[5] . The shape of the gamma distribution, governed by the shape 
parameter a, can range from bell-shaped (ot>l) to L-shaped (a< 
1). Consequently, the gamma distribution is capable of modelling 
various degrees of RHAS [5]. 

To reduce computational burden, most methods employ a 
discrete gamma model in which the continuous distribution is 
approximated by several rate classes with equal percentiles and 
probabilities [14]. Within each class, all of the rates are 
represented by the mean or median. The higher the number of 
rate categories, the better is the fit of the discrete gamma 
distribution to the continuous gamma distribution. The number of 
rate categories (k) used in phylogenetic analyses of nucleotide 
sequences generally ranges from 2 to 32 [16-18], with most 
analyses employing 4—10 categories. In an analysis of two small 
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data sets (4 sequences, 1352 sites and 5 sequences, 570 sites), Yang 
[14] found that 4 rate categories provided sufficiently good 
approximations of Q( and the likelihood and that there was little 
improvement in estimation accuracy when more than 8 categories 
were used. A greater number of rate categories might be 
preferable for large data sets [19]. 

RHAS can also be described using discrete rate-class models, in 
which the rate categories are not determined from some 
underlying distribution. A special case of the discrete-rates model 
is the invariable-sites model (+1), which assumes that there is a rate 
class with a rate of 0 [4,20-23]. In this model, some proportion of 
sites (pinv) is assumed to be invariable, or completely resistant to 
change, because mutations at these sites are strongly deleterious or 
fatal. The remaining sites are assumed to evolve at non-zero rates. 
The idea of invariable sites was inspired by studies of protein 
structures and has intuitive appeal [20,24]. The invariable-sites 
model can be combined with the gamma model of RHAS (+r+I) 
[25,26], and the resulting gamma-invariable mixture models are 
widely used. 

When performing a phylogenetic analysis, the inclusion of a 
model of RHAS is usually determined using a model-selection 
approach. A common practice is to compare a range of 
substitution models using a model-selection criterion, such as the 
Akaike information criterion [27] or Bayesian information 
criterion [28]. In this respect, the Bayesian information criterion 
has been shown to perform well across a range of simulation 
scenarios [29]. The best-fitting model is then used in subsequent 
analyses of the data set. 

Evolutionary models chosen using objective criteria are not 
always the most biologically pertinent [30], which raises questions 
about the meaning of the resulting estimates of parameters. A 
prominent example is the +1 model, which is sometimes selected as 
the best-fitting RHAS model for data that have been sampled at 
the population level. Intraspecific data tend to display lower levels 
of variation than sequences that have been sampled at the species 
level and above. This makes the evaluation oipj^^^ highly sensitive 
to taxon sampling [5] . We would expect the number of constant 
sites to decline as sample size increases, leading to lower estimates 
of jSinv Sites observed to be constant among sequences might not 
be invariable, but might simply have not experienced any 
mutations among the sequences that have been sampled. 
Additionally, deleterious mutations are much more likely to be 
found in population-level data [31,32], so that sites that would 
t)iMcally be treated as 'invariable' at the phylogenetic level might 
contain transient polymorphisms at the population level. 

The +r+I mixture model, first used by Gu et al. [25], has been 
criticised on the grounds that the two parameters involved - pi^^ 
and a - cannot be optimised independentiy of each other 
[15,19,33,34]. An L-shaped gamma distribution (a<l) already 
accommodates a proportion of low-variabilit)' sites; as a conse- 
quence, adding a parameter to account for invariable sites creates 
a strong c()rr(;lation betwe(;n and a This might cause 

considerable problems during the parameter optimisation process, 
since it is impossible to obtain reliable estimates of both 
parameters simultaneously [33]. Combining this with the afore- 
mentioned sensitivity of/);,iv to the size of the data set and the level 
of divergence, applying the +r+I model to intraspecific data sets 
appears particularly problematic, at least on theoretical grounds. 

The impact of using different RHAS models in phylogenetic 
analyses of intraspecific data is not well understood. The impact of 
the choice of RHAS model is most likely to be seen in estimates of 
branch lengths, which can have subsequent effects on the inferred 
tree topology. Here we investigate how varying the RHAS model 
affects phylogenetic analyses based on the molecular clock. We 



focus on five intraspecific data sets, four of which comprise 
heterochronous sequences (ancient DNA and viruses). In heter- 
ochronous data set, the ages of the sequences provide internal 
calibrations for the molecular clock, making such data ideal for 
studying evolutionary rates and timescales at the intraspecific level. 
We test whether increasing the number of gamma categories or 
assuming a proportion of invariable sites affects estimates of 
substitution rates and coalescence times. 

Materials and Methods 

We assembled five intraspecific data sets: (i) complete mitogen- 
omes from human haplogroup CI, (ii) complete mitogenomes 
from hominins, (iii) mitochondrial D-loop from muskox, (iv) PB2 
gene from HlNl human influenza virus, and (v) concatenated 
genome fragment from HIV-l. The first data set comprised 
isochronous sequences from the present day, whereas the last four 
data sets comprised heterochronous sequences of known ages. 
Details of the datasets are hsted in Table 1 . 

For each data set, sequence alignments were done using 
MUSCLE 3.8.31 [35] and adjusted manually. Sequences with 
uncertain ages were removed. For each data set, we performed 
preliminary Bayesian phylogenetic analyses using the best-fitting 
substitution models selected by the Bayesian information criterion. 
Alignments for aU datasets used in this study are available in File 
SI. Bayesian phylogenetic analyses of the data sets were performed 
in BEAST vl.7.5 [36]. The best-fitting model of nucleotide 
substitution was selected for each data set using the Bayesian 
information criterion in Modelgenerator 0.85 [30]. We tested four 
different RHAS models: equal rates among sites, +T, +\, and -t-F-l- 
I. For the +r models, we repeated the analysis using various 
numbers of rate categories ranging from 3-32. To test for rate 
heterogeneity among lineages, we initially used the uncorrelated 
lognormal relaxed clock [37]. For the human mitogenome data 
sets (i and ii), the coefficient of rate variation did not provide any 
evidence of rate variation among lineages. Marginal likelihoods 
were estimated using the harmonic-mean estimator [38] . 

Using the maximum-clade-credibility trees from our Bayesian 
analyses, we inferred the number of substitutions at each site using 
stochastic mutational mapping in SIMMAP [39,40]. We used the 
empirical nucleotide frequencies and applied the mean rate 
estimate from the Bayesian phylogenetic analysis of each data 
set to scale the branch lengths. Numbers were rounded to the 
nearest integer. The distribution of inferred mutational counts 
provided a picture of the RHAS pattern for each data set. We 
performed chi-squared tests to compare the goodness-of-fit of 
gamma and negative binomial distributions to the site-specific 
substitution counts. In the absence of RHAS, the distribution of 
site-specific substitution counts should conform to the Poisson 
distribution. In the presence of gamma-distributed RHAS, we 
expect the number of changes per site to conform to the negative 
binomial distribution. We simultaneously estimated the values of a 
by minimising y^. We compared the fit of these two distributions 
using the Akaike information criterion. 

In the analyses of the heterochronous sequences (data sets ii to 
v), the molecular clock was calibrated using the ages of the 
sequences. Following previous studies of viruses [41] and ancient 
DNA [42], we used date-randomisation tests to check that the 
spread and structure of the sequence ages were sufficient for 
calibrating estimates of substitution rates (Figure SI). This test 
involves the random reassignment of the dates to the sequences. If 
the mean posterior rate estimated from the original data set is 
included in any of the 95% credibility intervals of the rates 
estimated from the date-randomised replicates, the sequence ages 
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are considered to have insufficient structure and spread for 
calibrating the molecular clock. For the mitogenomes from human 
haplogroup CI (data set i), we assumed an age of 21,700 years 
(standard deviation 2700 years) for the root of the tree, following 
the study by Kumar et al. [43] . 

A constant-size coalescent prior was used for each analysis, 
based on Bayes factors calculated from the marginal likelihoods of 
each +r model. Posterior distributions of parameters, including 
the tree, were estimated by Markov chain Monte Carlo (MCMC) 
sampling. MCMC chains were run for at least 2xl0' steps, with 
samples drawn every lO'^ steps. After discarding an appropriate 
proportion of the samples as burn-in, we checked for acceptable 
sampling, mixing, and convergence to the stationary distribution 
in each case. 

Results 

Analysis of rate variation among sites 

All data sets showed a negative relationship between the 
number of sites in a rate category and the number of inferred 
substitutions per site. The category of zero inferred changes had 
the highest count for all data sets and, with the exception of the 
HIV data set, most sites were observed to be constant for each 
data set (Figure 1). For all data sets, the Poisson distribution 
provided a poor fit to the site-specific substitution counts (P<< 
0.001), with the negative binomial distribution providing a 
significandy better fit (Table 2). Despite this improvement, neither 
model provided a good approximation of the pattern of RHAS in 
the two human mitogenome data sets (i and ii), which were the 
largest and least variable. The HIV data set (v) features a large 
proportion of sites inferred to be non-constant, and shows 
considerable deviation from both the Poisson and negative 
binomial distributions. There was a particularly large proportion 
of sites with high inferred numbers of substitutions. 

Effect of RHAS model on estimates of likelihoods and 
parameters 

For four out of the five data sets, the marginal likelihood 
increased with the number of gamma rate categories (Figure 2). In 
general, 6-10 rate categories provided a good approximation of 
the asymptotic likelihood value for +r models, whereas there was 
minimal improvement in likelihood when using greater than 10 
gamma rate categories. It is noticeable that marginal likelihood is 
generally higher with +r+I models than with +r models, and that 
there is more rapid convergence towards the asymptotic value in + 
r+I models. In the analyses of the four heterochronous data sets 
(ii-v), neither varying the number of gamma rate categories nor 
allowing a proportion of invariable sites had any obvious impact 
on estimates of the coalescence time (root age) or the substitution 
rate (Figure 3). In almost all cases, the 95% credibility intervals 
overlapped substantially and the variance in means was <6% of 
the average 95% CI width (except for the HIV data set). The only 
exception was the rate estimate for the HIV data set (v), which 
slightly decreased with an increasing number of gamma rate 
categories. In the analysis of the mitogenome sequences from 
human haplogroups C 1 (data set i), neither varying the number of 
gamma categories nor allowing a proportion of invariable sites had 
any noticeable effect on the estimate of the substitution rate 
(Figure S2). 

The relationships between the ot shape parameter or and the 
number of gamma rate categories were less clear (Figure 2). The 
general pattern seemed to be a negative relationship between both 
parameters and the number of gamma rate categories, which can 
best seen in the declining posterior means and the non- 



3 



May 2014 | Volume 9 | Issue 5 | e95722 



Impact of RHAS Model on Phylogenetic Estimates 



(i) Human hg C1 mitogenomes 




100000 

m 

OJ 10000 
1000 

o 

1_ 100 
10 

^ 1 

0 



Number of substitutions at site 



(ii) Hominin mitogenomes 

rtrtlTTTr^ 



01234567 



Number of substitutions at site 



(iii) Muskox D-loop 



ITTTmmi 



Number of substitutions at site 



(iv) H1N1 PB2 




Number of substitutions at site 



(v) HIV partial genome 



01 



E 




6 8 10 12 14 16 18 20 22 24 26 28 30 34 36 38 

Number of substitutions at site 



Figure 1. Semi-logarithmic plots of substitutions per site for 
five intraspecific DNA data sets. Number of substitutions at each 
site were inferred using parsimony on the Bayesian estimates of the 
tree topologies. Columns indicate the number of sites against the 



number of substitutions occurring at each site. Red and green lines 
indicate the best-fitting Poisson and negative binomial distributions, 
respectively. 

doi:10.1371/journal.pone.0095722.g001 

overlapping 95% CIs of estimates based on small and large 
numbers of gamma categories. However, exceptions to both of 
these rules were observed in the sequences from human influenza 
A HlNl virus (data set iv). 

Increasing the number of gamma rate categories led to a linear 
increase in computation time (Figure 4). The computational costs 
associated with an increase in the number of gamma categories 
were similar between +r and +r+I models. With regard to the 
topologies of the maximum-clade-credibility trees, we observed 
that the major clades were unaffected by changes in the number of 
gamma categories or the inclusion of invariable sites (results not 
shown). Since this study is limited to data at the population level, 
which are described by the coalescent process, we did not 
investigate the impacts of different models on detailed phyloge- 
netic relationships. 

Discussion 

Our phylogenetic analyses of five intraspecific data sets have 
provided a number of insights into the performance of RHAS 
models. Varying the RHAS model, including the number of 
gamma rate categories or a proportion of invariable sites {pi„^), had 
neghgible impacts on our estimates of root age and substitution 
rate from heterochronous data sets. We found evidence of a 
complex interplay between a, the number of gamma rate 
categories, and When a proportion of the sites were assumed 
to be invariable, increasing the number of gamma rate categories 
generally caused a decrease in both ot and pi^v This is because the 
sites that are changing rapidly (mutational hotspots) are preferen- 
tially accommodated over less variable sites when there are few 
rate categories. When there is a limited number of rate categories, 
this results in a gamma distribution with a higher a. The presence 
of the invariable-sites parameter in this situation mitigates this 
bias, but results in an overestimation of 

Our results highlight a trade-off between computational cost 
and improved accuracy when an increasing number of gamma 
rate categories are used to model RHAS. The marginal likelihood, 
which describes the average fit of a model to the data, reaches a 
plateau as the number of gamma rate categories increases, a result 
that echoes those of Yang [14]. Using a large number of 
categories, however, incurs a significant computational cost, 
increasing both the RAM and time requirements for the likelihood 
calculations [44]. Our results suggest that using 6-10 rate 
categories provides a good approximation of the plateau likelihood 
value when not using invariable-sites models, and that increasing 
the number of rate categories incurs greater computational cost 
with minimal benefit. This contradicts suggestions that using a 
relatively small number of rate categories is insufficient to capture 
the complexities of the molecular evolutionary process [17]. 

For population-level analyses that aim to estimate substitution 
rates or coalescence times, our results suggest that gready 
increasing the number of gamma categories typically does not 
lead to substantial changes in parameter estimates. An exception is 
when the evolutionary rate is highly variable among sites and 
deviates strongly from a gamma distribution, as in the case of the 
HIV data set examined here. In such cases, a higher number of 
rate categories (8-10) might lead to a modest improvement in 
estimation accuracy. Surprisingly, the HlNl influenza virus data 
did not show a positive relationship between marginal likelihood 
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Table 2. Fit of Poisson and negative binomial distributions to the site-specific substitution counts in Figured 1, estimates of ot, the 
ratio of the P values for both distributions, the proportion of constant sites for five data sets and the substitution rate estimate for 
each data set. 



Number of 



Mean 

substitution 
rate (site^ 





Poisson 


Negative binomial 




zlAIC 


constant sites'* 


year 


(i) Human lig C1 
mitogenomes 


4.22 X 10-* 


Zi = 58.63 


0.048 
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96.8% 


4.22 X 10-* 




P = 3.73xlO-'*2 


P = 2.33x 10-" 










(ii) Hominin 
mitogenomes 


xl=4S\.2l 


Zi = 76.44 


0.092 


2617.6 


93.2% 


2.04 X 10-* 




P = 5.62x 10-"" 


P = 1.94x 10-" 










(iii) Muskox 


= 626.76 


Zi = 84.22 


0.24 


829.0 


69.5% 


9.51 X 10-' 




P = 3.3x 10-153 


P=1.89x 10-" 










(iv) H1N1 


xl = n93.ii 


110 = 153.95 


0.31 


1622.0 


63.6% 


1.75 X 10-^ 




p«io-™ 


P = 5.72x 10-2* 










(v) HIV 


X^, =3860.99 


Zjg =475.95 


0.39 


1 3845.0 


27.5% 


3.05 X 10-' 




p«io--™ 


P = 4.50x 10-" 











^This estimate of the shape parameter for gamma-distributed rates among sites was obtained by minimising x^- 
i^Sites having <0.5 mutations as inferred by stochastic mutational mapping in SIMMAP. 
doi:l 0.1 371 /journal.pone.0095722.t002 



and the number of rate categories, although the reasons for this 
are unclear. 

The results of our analyses using the invariable-sites model are 
more complex. The -t-F-l-I model is intuitively appealing and is, 
according to the Bayesian information criterion, the preferred 
model for four of the five data sets analysed (Table 1). However, 
estimates of parameters in the H-F-l-I model are highly sensitive to 
taxon sampling [5,34] and there is a strong correlation between 
the proportion of invariable sites and the gamma shape parameter 



[34]. Our analyses reveal that estimates of the proportion of 
invariable sites are also highly susceptible to changes in the 
number of rate categories. Our results corroborate the notion that 
estimates of both parameters are inevitably biased when the -t-F-l-I 
model is used [34]. 

The interdependence of and a in the -l-F-l-I model has led 
some researchers to warn against its use [33,45]. For example, in 
the justification of their exclusion of-l-F-l-I models in their analyses, 
Ren et al. [45] contended that one should also consider the 
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Figure 2. Bayesian phylogenetic estimates of various parameters against number of gamma rate categories for five intraspecific 
DNA data sets. Fronn top to bottom, rows show estimates of marginal likelihood, the gamma shape parameter (a), and the proportion of invariable 
sites (pinv). Filled blue and empty red markers represent parameter estimates using +r and +r+\ models, respectively. 
doi:10.1371/journal.pone.0095722.g002 
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Figure 3. Bayesian phylogenetic estimates of substitution rate and root age against number of gamma rate categories for four 
intraspecific DNA data sets. Filled blue and empty red markers represent parameter estimates using +r and +r+l models, respectively. 
doi:1 0.1 371 /journal.pone.0095722.g003 



"biological interpretations of the models and the robustness of 
analysis to model assumptions" (p. 815), not just the fit of the 
model to the data. We concur with this viewpoint, considering that 
the invariable-sites assumption is particularly troublesome for 
intraspecific data. Here we would expect a +r+I model to perform 
well for data sets that show a bimodal distribution in site-specific 
rates (one peak at 0 and one peak at >0). However, when the aim 
is to estimate the substitution rate or coalescence times for a 
population-level data set, we found that allowing a proportion of 
invariable sites did not alter the results substantially. There was, 
however, a small computational benefit associated with +r+I 
models, since they generally outperformed +T models in marginal 
likelihood at small numbers of gamma categories, reducing the 
need for higher numbers of categories in +r models. In some 
instances (data set i and ii), the asymptotic marginal likelihood for 
+r+I models was slightly higher than the marginal likelihood for + 
r models. 



Using a gamma distribution to model RHAS has deservedly 
been popular, owing to its good fit and mathematical simplicity 
[46]. There is, however, no reason to believe that the distribution 
of rates among sites actually follows the gamma distribution. 
Indeed, we observe that the gamma model stiU does not provide an 
accurate picture of RHAS, especially for the least variable data sets 
that we examined (i and ii). Some studies have explored the 
possibility of using alternative approaches to model RHAS. 
Notably, the discrete-rates CAT model, implemented in RAxML 
[47] and FastTree2 [44], has been shown to be computationally 
more efficient than the traditional +T model and yields tree 
topologies with improved likelihood values [48]. Recently, Wu et 
al. [49] proposed a Bayesian method of automatic model selection 
that simultaneously estimates the substitution model and rate at 
each site. The performance of these parameter-rich models, with 
regard to phylogenetic analyses of intraspecific sequence data, 
warrants further study. 



120 n 




0 5 10 15 20 25 30 35 



Number of gamma categories 



Figure 4. Computation time (min) as a function of the number of gamma rate categories, for the hominin mitogenome data set (ii). 

Filled blue and empty red markers represent computation time using TrN+F and TrN+l+F models, respectively. Each Markov chain was run for 10* 
steps on a six-core processor (Intel Xeon W3690). 
doi:1 0.1 371/journal.pone.0095722.g004 



PLOS ONE I www.plosone.org 



6 



May 2014 | Volume 9 | Issue 5 | e95722 



Impact of RHAS Model on Phylogenetic Estimates 



Conclusions 

The inclusion of a parameter for the proportion of invariable 
sites is a legacy of studies conducted at the interspecific level. At 
the intraspecific level, estimating the proportion of invariable sites 
is primarily a statistical measure that does not have much 
biological meaning. Here we suggest that most intraspecific studies 
of substitution rates or coalescence times can use fi-wcr than 10 
gamma rate categories, in order to achieve a balance between 
model complexity, computational efficiency, and parameter 
estimation. The results of our study apply to population-level 
data, but are probably relevant to data sets containing sequences 
from closely related species. Further studies of rate variation in 
interspecific data sets will provide additional insights into the 
performance of RHAS models. 

Supporting information 

Figure SI Results of the date-randomisation test for 
temporal signal in heterochronous data. The plot shows 
the rate estimates and their 95% credibility intervals for the 
unrandomised data (empty markers)_and date-randomised repli- 
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